Characterization and phylogenetic analysis of the complete mitochondrial genome sequence of Diospyros oleifera, the first representative from the family Ebenaceae

Plant mitochondrial genomes are a valuable source of genetic information for a better understanding of phylogenetic relationships. However, no mitochondrial genome of any species in Ebenaceae has been reported. In this study, we reported the first mitochondrial genome of an Ebenaceae model plant Diospyros oleifera. The mitogenome was 493,958 bp in length, contained 39 protein-coding genes, 27 transfer RNA genes, and 3 ribosomal RNA genes. The rps2 and rps11 genes were missing in the D. oleifera mt genome, while the rps10 gene was identified. The length of the repetitive sequence in the D. oleifera mt genome was 31 kb, accounting for 6.33%. A clear bias in RNA-editing sites were found in the D. oleifera mt genome. We also detected 28 chloroplast-derived fragments significantly associated with D. oleifera mt genes, indicating intracellular tRNA genes transferred frequently from chloroplasts to mitochondria in D. oleifera. Phylogenetic analysis based on the mt genomes of D. oleifera and 27 other taxa reflected the exact evolutionary and taxonomic status of D. oleifera. Ka/Ks analysis revealed that 95.16% of the protein-coding genes in the D. oleifera mt genome had undergone negative selections. But, the rearrangement of mitochondrial genes has been widely occur among D. oleifera and these observed species. These results will lay the foundation for identifying further evolutionary relationships within Ebenaceae.


Introduction
Mitochondria are the main organelle involved in energy metabolism in plants [1,2]. They supply ATP via oxidative phosphorylation for metabolism, cell differentiation, apoptosis, cell growth, and cell division and are abundant in energy-consuming tissues involved in essential biological functions [1,2,3,4]. Therefore, mitochondria play an important role in plant productivity and development [2,5,6]. According to endosymbiotic theory, plant mitochondria are believed to have descended from free-living bacteria-independent microorganisms, which explains the presence of their genomes [5,7].
During evolution, the plant mitochondrial (mt) genome underwent dramatic changes in, for example, the gene order, genome structure, and migration of sequences from other organelles [5,7,8,9]. Thus, plants have about 100-10,000 times larger and more structurally complex mitochondrial (mt) genomes than animals [10,11,12]. The mt genomes of plants demonstrate significant genome size variation, from 66 kb [13] to 11.3 Mb [14]; the number of protein-coding genes varies from 14 to 67 [15]; and the number of tRNA genes varies from 3 to 27 [9]. There are variations in mitochondrial genomes not only between plant species but also within the same species [9,12,16,17], in stark contrast to the conserved structure of plant chloroplast genomes [16,17,18]. Thus, mt genomes have been used as a valuable source of genetic information and for investigation of essential cellular processes in many phylogenetic studies [18,19,20,21].
While, these characteristics of plant mt genomes (bigger size, more structural complexity, and low conservation across species) make plant mitochondrial genome assembly difficult [1,8,10]. To date, more than 5000 plant chloroplast genomes have been sequenced, but only about 400 mt genome sequences are available (www.ncbi.nlm.nih.gov/geno me/organelle/, 11/11/2021). In addition, sequenced plants largely differ in their classification, and only three complete mitochondrial genomes of species from the order Ericales have been identified.
Diospyros L., from the Ebenaceae family, is a plant genus that includes over 500 species widely distributed across tropical and subtropical regions [22] and that is one of the largest angiosperm genera [23]. Among these species, Diospyros oleifera and Diospyros kaki have been cultivated as an important fruit crop in China, Korea, Japan for centuries, due to its edible fruit is rich in vitamins, sugars, nutrients and antioxidants vital for optimum health with various medicinal and chemical [24,25]. Morphological, molecular, and genomic studies have shown that D. oleifera can be used as a model plant [24,26]. Chloroplast genome sequencing has been performed in 15 species of Diospyros [26,27], and nuclear genome sequencing has been performed in D. oleifera [23,28] and Diospyros lotus [29,30]. However, to date, no mt genome of any species in Ebenaceae has been reported.
Fortunately, advancements in long read sequencing, such as PacBio and Oxford Nanopore, have made organelle genome sequencing easier and faster. Therefore, in this study, we constructed the complete mt genome of D. oleifera based on PacBio and Illumina data, performed a phylogenetic analysis, and compared the complete mt genomes of D. oleifera and related genera. These results will help better understand the features of the D. oleifera mitochondrial genome and lay the foundation for identifying further evolutionary relationships within Ebenaceae.

Samples and mitogenome sequencing
Due to the advancement of sequence technology, long reads, used for de novo assembly of organelle genomes without the need for organelle DNA isolation, could be easily generated from high throughput sequencing. The well-established methodology is quite efficient and well accepted in the scientific community [1,31,32,33,34,35,36,37,38,39,40,41,42].
In this study, Mature leaves of D. oleifera (at latitude 34.27569 and longitude 107.75079) were used to isolate total DNA following the protocol for the Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA) and the SMRTbell Libraries protocol for PacBio data (Pacific Biosciences, Menlo Park, CA, USA). All these whole-genome Illumina HiSeq and PacBio sequencing data were deposited in the NCBI GenBank (accession no. PRJNA562043) and the Persimmon Genome Website (http://www.kak iwi.zju.edu.cn/cgi-bin/persimmon/about_genome.cgi).
Sequencing reads of the mitochondria were filtered and extracted from these WGD sequencing data of D. oleifera. Raw data of second-generation sequencing were filtered using fastp version 0.20.0 software (https://github.com/Op enGene/fastp) [43]. The three-generation sequencing data of mitochondrial reads were error-corrected, trimmed, and de-novo-assembled using a Canu assembler (version 1.5) with default parameters [44]. Then, the contig sequence was obtained. The gene databases of plant mitochondria (the mitochondrial gene sequences of species published on the NCBI) were compared using blast v2.6 (https://blast.ncbi.nlm.nih.gov/Blast.cgi), and contigs that matched with the mitochondrial gene as the seed sequence were selected. The original data were used to extend and circularize the contigs to obtain the ring-dominant structure (or secondary ring), and then, the assembly was polished using NextPolish 1.3.1 (https://gith ub.com/Nextomics/NextPolish) [45]. The assembly results were calibrated using second-and third-generation data, and the parameters were set as rerun ¼ 3 and -max_depth ¼ 100. Then, the final assembly results were obtained.

Genome annotation
The assembled D. oleifera mt genome was annotated using the GeSeq tool [46]. To confirm the annotated results, the assembled D. oleifera mt genome was also BLAST-searched against protein-coding genes and ribosomal RNA (rRNA) genes of available plant mt genomes at the NCBI. Then, the sequence coordinates of the identified protein-coding genes (PCGs) were manually verified for start and stop codons. The annotations of transfer RNA (tRNA) genes were also confirmed by tRNAscan-SE [47]. ViennarNA-2.4.14 [48] was used to visualize the secondary structure of tRNA. The physical circular map was drawn using the Organellar Genome DRAW (OGDraw) v1.2 program [49]. The final annotated mt genome sequences of D. oleifera have been deposited in the NCBI Gen-Bank (accession no. MW970112).
Strand asymmetry was calculated according to the formulas: AT-skew [50]. The possible RNA-editing sites in the PCGs of D. oleifera were predicted using the online predictive RNA editor for plant mitochondrial genes (PREP-Mt) [51] suite of servers (http://prep.unl.edu/). The codon frequencies were calculated using the Codon Usage tool in the Sequence Manipulation Suite (bioinformatics.org/sms2/codon_usage.html) [52]. The relative synonymous codon usage (RSCU [53]) was calculated using the CAI Python package of Lee [54].

Chloroplast-to-mitochondrion-DNA transformation
The D. oleifera cp genome (NC_030787.1) was downloaded from the NCBI Organelle Genome Resources Database. The protein-coding and tRNA genes, which were transferred from chloroplasts to mitochondria, were identified using Blastn software with the following screening criteria: matching rate !70%, E-value 1e À 10, and length !30 bp.
The synonymous (Ks) and nonsynonymous (Ka) substitution rates of the protein-coding genes in the D. oleifera mt genome were analyzed using the 27 species. In this analysis, KaKs_Calculator (v2.0) [59] with the MLWL model was used to calculate Ka/Ks. Genome synteny and rearrangements among the using six representative species (Table 1) mitogenomes were analyzed using the progressive Mauve algorithm as implemented in Mauve ver. 2.4.0 software [60].
Previous studies have shown that rps10 is missing in the mt genomes of most plants, such as Arabidopsis thaliana, Brassica napus, and Beta vulgaris, and that its function is replaced by the nuclear gene [9]. However, the rps10 gene was found in the D. oleifera mt genome. The absence of rps2 and rps11 genes in the D. oleifera mt genome, consistent with R. simsii [64] and V. macrocarpon [3], supports Adams' speculation that rps2 and rps11 genes were lost in the early evolution of eukaryotic plants [3]. Similar to Nicotiana tabacum [68] and M. luteus [4], the D. oleifera mt genome has no rps1 gene, whereas rps1 is present in the V. macrocarpon mt genome [3] and two copies of rps1 are present in the R. simsii mt genome [64].
The persimmon mitochondria have 27 tRNAs (23 typical tRNA genes, one more trnN-GTT and three more trnM-CAT). The average length of  these tRNAs is 67-88 bp, with a total length of 1479 bp. The number of tRNAs in the D. oleifera mt genome is more than that in other asterids, such as V. macrocarpon (18) [3], R. simsii (23) [64], M. luteus (24) [4], and N. tabacum (21) [68]. This may be because some tRNAs in the D. oleifera mt genome have multiple copies; for example, trnN-GTT has two copies and trnM-CAT has four copies. The secondary structures are shown in   report [19,21,38,39,42], G-T (U) matches were also found in mostly tRNA secondary structures in the D. oleifera mt genome. The total gene length added up to 8% of the total mt genome length, with protein-coding regions comprising only 6.5% (32 kb) of the genome length. The gene content of D. oleifera is similar to that of the published mt genomes of asterids, especially Mimulus guttatus (7.4%) [4] and Helianthus annuus (8.5%) [71]. We found 54 genes with no introns, accounting for 78.26% of the total, consistent with the result conclusion that 63.2%-100% of mitochondrial genes in most plants have no introns [8,9]. In addition, 30 introns were found in the other 15 D. oleifera mt genes; nad1, nad2, nad5, and nad7 had 4 introns; nad4, 3 introns; and cox2, 2 introns.

Repeat sequences analysis
Simple sequence repeats (SSRs, or microsatellites) are DNA stretches consisting of short, tandem units of sequence repetitions 1-6 base pairs in length [72]. We identified 87 SSRs in the D. oleifera mt genome. The proportions of different repeat units are shown in Table 1. Tetranucleotide repeats were the most abundant SSR type, constituting 68.97% of all identified SSRs, and there were 7 SSRs in di-, tri-, and pentanucleotide repeats, accounting for 8.05% of all identified SSRs. There were only three mono-and hexanucleotide repeats in the D. oleifera mt genome. AAAG/CTTT motifs (16) were most recurrent motifs, representing 18.39% of all identified SSRs (Table S1).
Tandem repeats (satellite DNA) are core repeating units of 1-200 bases repeated several times in tandem [73]. As shown in Table 4, 12 tandem repeats 6 to 30 bp long were observed in the D. oleifera mt genome.
In addition, 760 non-tandem repeats, with 30 bp or more in length, were detected in the D. oleifera mt genome. Of the 760 non-tandem repeats, 426 were direct, 332 were palindromic, and 2 were reverse. The longest direct-type repeat was 115 bp long, while the longest inverted repeat was 331 bp long (Table S2). As shown in Figure 3, the 30-39 bp repeats were most abundant for both repeat types.
The repetitive sequence in the D. oleifera mt genome was 31 kb, accounting for 6.33% of the total mitochondria. This is considered a medium proportion of repeats, higher than that in Boea hygrometrica, (1.5%) and V. macrocarpon (3%) and lower than that in N. tabacum (13%) [68] and D. carota (16%) [66]. The different proportions of repeats may be because the mitochondria of B. hygrometrica, V. macrocarpon, and  D. oleifera are mainly short repeating units, whereas those of tobacco and carrots are mainly longer repeating units [66].

The prediction of RNA editing
The number of RNA-editing sites varies in different species and is usually frequent in angiosperm and gymnosperm mitochondria. We predicted 515 RNA-editing sites within all the 38 protein-coding genes ( Table 5) in the D. oleifera mt genome, which is more than those in A. thaliana (441) [5], Suaeda glauca (261) [73], Eucalyptus grandis (470) [74], and Citrullus lanatus (463) [75] and less than those in gymnosperms with larger mt genomes, such as Taxus cuspidata (974), Pinus taeda (1179), Cycas revoluta (1206), and Ginkgo biloba (1306) [32]. However, whether the number of RNA-editing sites is positively correlated with the size of the mt genome requires further research.
The selection of mitochondrial RNA-editing sites in D. oleifera shows a high degree of compositional bias. All RNA-editing sites are the C-T editing type, which is consistent with the fact that C-T is the most common editing type found in plant mt genomes [76,77,78]. In previous studies, almost half of the mitochondrial RNA editing occurred at the second codon position [73,77]. The proportion of RNA-editing sites at the second codon position in the D. oleifera mt genome is also about 45.72% (235), slightly less than that at the first codon position (259; 50.39%). However, no editing site was found at the third position of triplet codons, consistent with the fact that RNA-editing sites are rare in plant mt genomes [73,78].
Due to mitochondrial RNA editing, the D. oleifera mt genome has more RNA-editing sites but fewer editing types (Table 5). There were only 29 codon transfer types, corresponding to 14 amino acid transfer types, among the 515 RNA-editing sites. The types of transfer are comparable to those of most gymnosperms (30-40 codons; around 20 amino acids) [32,76] but less than those of monocotyledonous and dicotyledonous plants (50-60 codons; around 30 amino acids) [74,75,78]. Among the 29 codon transfer types, TCA ¼> TTA was the most common, with 78 sites. A leucine tendency after RNA editing, supported by the fact that 45.24% (233 sites) of the edits are converted to leucine, was found in the amino acids of predicted editing codons. After RNA editing, 43.59% of the amino acids remained hydrophobic. However, 47.57% of the amino acids were predicted to change from hydrophilic to hydrophobic, while 7.77% were predicted to change from hydrophobic to hydrophilic.
The number and type of RNA-editing sites differed among the mt genomes of D. oleifera and other species. Like with most angiosperms [73,76], ribosomal proteins (except rps4) and ATPase subunits (except atp6) had a relatively small number of RNA-editing-derived substitutions (1-12 sites), while the transcripts of NADH dehydrogenase subunits and cytochrome c biogenesis genes were significantly edited (11-36 sites; Figure 4), and ccmFn and ccmB had the most RNA-editing sites predicted (36,35).
In D. oleifera, 10,611 amino acids were encoded. The most frequently used amino acids were Leu (10.25%), Ser (9.23%), and Arg (6.86%), and the least common amino acids were Trp (1.52%) and Met (2.65%) (Figure 4). The relative synonymous codon usage (RSCU) value for D. oleifera for the third codon position is shown in Figure 5. Consistent with most of the currently studied mitochondrial genomes [1,73,76], the use of both two-and four-fold degenerate codons was biased toward the use of codons abundant in A or T.

Chloroplast-derived mitogenomic sequences
The transfer of DNA sequences among chloroplast and mt genomes has been frequently observed in the mt genomes of plants [79]. In many cases, the chloroplast DNA content in the mt genomes of most plants is 3%-6%, sometimes reaching up to about 10% [80]. The D. oleifera mt genome contained 28 chloroplast insertions, ranging in length from 32 to 5703 bp ( Figure 6, Table 6), with a total length of 32.83 kb, accounting for 6.65% of the total length of the genome, which is greater than the mitochondrial genome lengths of Liriodendron tulipifera (3%) [31], and N. tabacum (2.5%) [68]; comparable to those of C. lanatus (6%) [75], E. grandis (6%) [74], and Oryza sativa (6.3%) [81]; and less than those of Vitis vinifera (8.8%) [36] and Cucurbita pepo (11.5%) [75]. Among the transfer DNA sequences, some chloroplast protein-coding genes, such as atpB, atpE, rps12, rpoB, petA, psaA, and psbC, lost their integrity while migrating from the cp to the mitochondria, and only partial sequences of those cp-derived PCGs could be found in the D. oleifera mt genome (Table 6). In the D. oleifera mt genome, 11 chloroplast-derived tRNAs with a complete sequence were identified: trnA-UGC, trnD-GUC, trnI-GAU, trnM-CAU, trnN-GUU, trnP-UGG, trnR-ACG, trnS-GCU, trnS-GGA, trnV-GAC, and trnW-CCA. The different completeness levels of the transferred PCGs and tRNA genes showed that tRNA genes are much more conserved in the mt genome than PCGs, indicating that tRNA genes play an indispensable role in mitochondria. The transfer of these tRNAs can be traced back to the retention of an earlier horizontal gene transfer event. In accordance with the present results, cp-derived trnM-CAU first appeared in gymnosperms [82]; cp-derived trnD-GUC mainly appeared in dicotyledons, not in monocotyledons [76]; and cp-derived trnM-CAU and trnD-GUC were both found in the D. oleifera mt genome. However, the absence of cp-derived trnH-GTG, which is commonly found in angiosperms [3,74,76,82], and the presence of cp-derived trnA-UGC, lost during early evolution of terrestrial plants [80,83], indicate that special evolutionary events may be occurring during D. oleifera formation.

Phylogenetic, Ka/Ks and gene arrangement analysis
To detect the evolutionary status of the D. oleifera mt genome, a phylogenetic analysis was performed on D. oleifera, together with 27 other species: 23 eudicots (19 asterids and 4 rosids), 2 monocotyledons, and 2 gymnosperms (designated as outgroups). Phylogenetic relationships were analyzed using the concatenated dataset (28 PCGs: atp1, atp4, atp6, atp8, atp9, ccmB, ccmC, ccmFC, ccmFN, cob, cox1, cox2, cox3, matR, nad1, nad2, nad3, nad4, nad4L, nad5, nad6, nad7, nad9, rpl5, rps12, rps13, rps3 and rps4) through ML phylogenetic analysis. The abbreviations and accession numbers of the mt genomes investigated in this study are listed in Table S1. As outgroups, the two gymnosperms were distinct from the other angiosperms. The phylogenetic tree (Figure 7) strongly supported the separation of asterids from rosids and the separation of eudicots from monocots. Moreover, the taxa from 20 families (Apiaceae, Aquifoliaceae, Asteraceae, Brassicaceae, Campanulaceae, Convolvulaceae, Ericaceae, Ebenaceae, Fabaceae, Ginkgoaceae, Lamiaceae, Lentibulariaceae, Oleaceae, Phrymaceae, Pinaceae, Poaceae, Rosaceae, Rubiaceae, Solanaceae, and Vitaceae) were well clustered. In addition, the monophyly of D. oleifera, which belongs to the single genus of Diospyros in the Ebenaceae family, was well supported based on mt genomes (Figure 7). Consistent with previous comparative genome studies [23,28,29], this study also found that the clade united V. macrocarpon and R. simsii and then formed a sister cluster with the Ebenaceae family with high confidence (bootstrap value of 100%). In general, the phylogenetic tree topology was in line with the evolutionary relationships among those species, indicating the consistency of traditional taxonomy with the molecular classification.
To evaluate selective pressures during the evolutionary dynamics of protein-coding genes among closely related species, the nonsynonymous (Ka) and synonymous (Ks) substitution ratio (Ka/Ks) was calculated. For the Ka/Ks calculation, 28 PCGs from the D. oleifera mt genome were compared with the mt genomes of 27 species.
As shown in Figure 8, for the gene-specific substitution rates, Ka/Ks ranged from 0.031 at the cox1 gene in V. macrocarpon to 4.321 at the atp4 gene in D. carota. In 58 cases (except Glycine max, O. sativa, Platycodon grandiflorus, Scyphiphora hydrophyllacea and Z. mays), the Ka/Ks values of D. oleifera gene-specific substitution rates were higher than 1, compared with 22 other species, suggesting positive selection during evolution. Among the 22 species, nine substitution genes with higher Ka/Ks values were found between the D. oleifera and V. vinifera mt genomes and six genes between the D. oleifera and V. macrocarpon mt genomes. The atp4 and atp8 genes exhibited the highest average rate (1.348 and 0.751) and 15 and 5 Ka/Ks values above 1, respectively, suggested to be the result of positive or relaxed selection [2]. However, most genes had undergone negative selection pressures during evolution, supported by the fact that the Ka/Ks values of 654 proteins, accounting for 91.59% of the proteins in D. oleifera, were less than 1 compared to the other plant species. The atp1 and cox1 genes have the smallest average Ka/Ks values (0.212 and 0.272), indicating strong purifying selection [34,84]. These results show that mt genes are highly conserved during the evolutionary process in green plants.
Because of no mt genome of any species in Ebenaceae has been reported, Synteny of entire mitochondrial genomes was only compared among four Asterids (including three Ericales, one Apiales, and one Solanales) and one Rosids species in this study to assess the degree of structural rearrangement between different lineages. Figure 9 and Figs.1 showed that the rearrangement of mitochondrial genes has been widely occur among these six species, which is accords with many mitogenome observations [20,37,38,39,41,42]. When using D. oleifera as a reference genome, The dot-plot analyses showed sequences or synteny were seldom shared, and only short stretches of synteny among species (Figs.2). These Large rearrangement events have indicated differentiation within these six species mitogenome. Understandably, species that have close evolutionary relationships share more clusters [20,41,42], for example, In general, longer synteny sequences with higher similarity were found between D. oleifera and V. macrocarpon than that between D. oleifera and M. domestica.
Among genus Diospyros, The most well-known species is D. kaki, which has been cultivated as an important fruit crop, due to its edible fruit [25]. However, D. kaki are hexaploid (2n ¼ 6Â ¼ 90) or nonaploid (2n ¼ 9Â ¼ 135) and their origin, and polyploidization mechanisms are unclear, which has hampered genome sequencing and molecular breeding [23,28]. Phylogenetic analyses based on the nuclear [23,28]  and chloroplast [26,27] genome and mtDNA non-coding fragments [85] have indicated that D. oleifera is more closely related to D. kaki [24] and could be used as a model plant for studies of Diospyros [24,26]. So, as the nuclear and mt genome of hexaploid cultivated persimmon both remains unpublished, the availability of the D. oleifera mt genome provides more alternative comparable reference for D. kaki than D. lotus does. In addition, our results will lay the foundation for identifying further evolutionary relationships within Ebenaceae. However, due to the lack of  adequate representative mitogenomes, more Ebenaceae mitogenomes are needed to be sequenced to better resolve the phylogeny and evolutionary biology of Ebenaceae.

Conclusions
Here, we presented the first mitochondrial genome assembly and annotation of an Ebenaceae model plant Diospyros oleifera as well as the mitochondrial genome in the family Ebenaceae. The mitogenome was 493,958 bp in length, contained 39 protein-coding genes, 27 transfer RNA genes, and 3 ribosomal RNA genes. Comparative analysis of gene structure, codon usage, repeat regions and RNA-editing sites shows that rps2 and rps11 genes are missing, and a clear bias of RNA-editing sites is existing in the D. oleifera mt genome. In addition, the phenomenon that intracellular tRNA genes transferred frequently from chloroplasts to mitochondria was also observed in D. oleifera. Moreover, Phylogenetic analysis based on the mt genomes of D. oleifera and 27 other taxa indicates consistency in molecular and taxonomic classification. Furthermore, The Ka/Ks analysis based on code substitution revealed that most of the coding genes had undergone negative selections, indicating the conservation of mt genes during the evolution. These results will help in better understanding the features of the D. oleifera mitochondrial genome and lay the foundation for identifying further evolutionary relationships within Ebenaceae. However, due to the lack of adequate representative mitogenomes, more Ebenaceae mitogenomes are needed to be sequenced to better resolve the phylogeny and evolutionary biology of Ebenaceae.

Data availability statement
Data associated with this study [The final annotated mt genome sequences of D. oleifera] has been deposited at NCBI GenBank under the accession number MW970112.

Declaration of interests statement
The authors declare no conflict of interest.

Additional information
Supplementary content related to this article has been published online at https://doi.org/10.1016/j.heliyon.2022.e09870.